\part{Introduction}
\chapter{Introduction}
\par
The {\bf SPOOLES} package is used to solve two types of 
real or complex linear systems:
\begin{itemize}
\item
$AX = Y$ or $(A + \sigma B)X = Y$ where $A$ and B are square.
$A$ and $B$ can be real or complex,
symmetric, Hermitian or nonsymmetric.
The factorization can proceed with or without pivoting for
numerical stability.
The factor matrices can be stored with or without dropping small
entries.
\item
Minimize $\|AX_{*,j} - Y_{*,j}\|_2$ for each column of the solution
matrix $X$ and right hand side matrix $Y$.
This is done by computing a $QR$ factorization of $A$ and then
solving $R^T R X = A^T Y$ or $R^H R X = A^H Y$.
\end{itemize}
In both cases, the linear systems can be permuted to reduce the
fill in the factor matrices.
\par
The {\bf SPOOLES} software is written in an object oriented fashion
in the C language.
Parts of the software run in serial mode, 
multithreading using Solaris or POSIX threads, and with MPI.
\par
The software objects are naturally partitioned into three families of
objects.
\par
\begin{center}
\begin{tabular}{|l|l|} 
\multicolumn{2}{c}{\bf Utility objects} \\ \hline
{\tt A2} & dense two dimensional array \\
{\tt Coords} & object to hold coordinates in any number of dimensions \\
{\tt DV} & double precision vector \\
{\tt Drand} & random number generator \\
{\tt I2Ohash} & hash table for the factor submatrices \\
{\tt IIheap} & simple heap object \\
{\tt IV} & int vector \\
{\tt IVL} & int list object \\
{\tt Ideq} & simple dequeue object \\
{\tt Lock} & abstract mutual exclusion lock \\
{\tt Perm} & permutation vector object \\
{\tt Utilities} & various vector and linked list utility methods \\
{\tt ZV} & double precision complex vector \\
\hline
\end{tabular}
\end{center}
\par
\begin{center}
\begin{tabular}{|l|l|} 
\multicolumn{2}{c}{\bf Ordering objects} \\ \hline
{\tt BKL} & Block Kernihan-Lin algorithm object \\
{\tt BPG} & bipartite graph object \\
{\tt DSTree} & domain/separator tree object \\
{\tt EGraph} & element graph object \\
{\tt ETree} & front tree object \\
{\tt GPart} & graph partitioning algorithm object \\
{\tt Graph} & graph object \\
{\tt MSMD} & multi-stage minimum degree algorithm object \\
{\tt Network} & network object for solving max flow problems \\
{\tt SolveMap} & map of submatrices to processes for solves \\
{\tt Tree} & tree object \\
\hline
\end{tabular}
\end{center}
\par
\begin{center}
\begin{tabular}{|l|l|} 
\multicolumn{2}{c}{\bf Numeric objects} \\ \hline
{\tt Chv} & block chevron object for fronts \\
{\tt ChvList} & object to hold lists of {\tt Chv} objects \\
{\tt ChvManager} & object to manager instances of {\tt Chv} objects \\
{\tt DenseMtx} & dense matrix object \\
{\tt FrontMtx} & front matrix object \\
{\tt ILUMtx} & simple preconditioner matrix object \\
{\tt InpMtx} & sparse matrix object \\
{\tt Iter} & Krylov methods for iterative solves \\
{\tt PatchAndGoInfo} & 
    modified factors in the presence of zero or small pivots \\
{\tt Pencil} & object to contain $A + \sigma B$ \\
{\tt SemiImplMtx} & semi-implicit factorization matrix object \\
{\tt SubMtx} & object for dense or sparse submatrices \\
{\tt SubMtxList} & object to hold lists of {\tt SubMtx} objects \\
{\tt SubMtxManager} & 
       object to manager instances of {\tt SubMtx} objects \\
{\tt SymbFac} & algorithm object to compute a symbolic factorization \\
\hline
\end{tabular}
\end{center}
The {\tt MT} directory contains all the multithreaded methods 
and drivers programs.
The {\tt MPI} directory contains all the {\tt MPI} methods and drivers.
The {\tt misc} directory contains miscellaneous methods and drivers.
\par
Each of the following objects that hold numeric entries ---
{\tt A2}, {\tt Chv}, {\tt DenseMtx}, {\tt FrontMtx}, 
{\tt ILUMtx},
{\tt InpMtx},
{\tt Pencil},
{\tt SemiImplMtx} and 
{\tt SubMtx} --- can hold real or complex entries.
An object knows its {\tt type},
{\tt 1} for real (define'd constant {\tt SPOOLES\_REAL})
or
{\tt 2} for complex (define'd constant {\tt SPOOLES\_COMPLEX}).
Since C does not yet have a standard structure for complex numbers,
we have followed the FORTRAN convention of storing the real and
imaginary parts of a complex number in consecutive memory locations.
Internally, we unroll the complex arithmetic into real arithmetic.
The user need not be burdened by this process if (s)he uses the
input/output methods for the different object.
For example, 
{\tt DenseMtx\_setRealEntry()} sets an entry of a real dense matrix, 
while {\tt DenseMtx\_setComplexEntry()} sets an entry of a complex
dense matrix. 
\par
All the heavily used computational tasks have been expanded where
possible into BLAS2 or BLAS3 kernels, for both the real and complex
cases.
There are a multitude of driver programs that test the
functionality of the objects.
A common output of a driver program is a file that can be input
into Matlab to check the errors of the computations.
This convention inspires confidence in the correctness of the
kernel computations.
\par
\section{Software Design}
\label{chapter:softwareDesign}
\par
The {\bf SPOOLES} library is written in the C language and uses
object oriented design.
There are some routines that manipulate native C data types such as
vectors, but the vast bulk of the code is centered around objects,
data objects and algorithm objects.
By necessity, the implementation of an object is through the C {\tt
struct} data type.
We use the following naming convention --- a method (i.e., function)
associated with an object of type {\tt Object} has the form

\centerline{{\it (return value type)}
{\tt Object\_}{\it methodName}{\tt (Object * obj}, $\ldots${\tt )};}

The method's name begins with the name of the object it is
associated with and the first parameter in the calling sequence is
a pointer to the instance of the object. 
Virtually the only exception to this rule is the {\it constructor} 
method.

\centerline{\tt Object * Object\_new(void) ;}

\noindent 
Two objects, the {\tt Chv} and {\tt DenseMtx} objects, have
methods that return the number of bytes needed to hold their data,
e.g.,

\centerline{\tt 
int Chv\_nbytesNeeded(int nD, int nL, int nU, int type, int symflag) ;}
\par
Scan the directory structure of the source code and you will notice
a number of subdirectories --- each deals with an object.
For example, 
the {\tt Graph} directory holds code and documentation for an
object that represents a graph:
its {\tt doc} subdirectory holds \LaTeX files with documentation;
its {\tt src} subdirectory holds C files that contain methods
associated with the object ;
% (each method is a C function of the form {\tt Graph\_*}); 
and its {\tt driver} subdirectory holds driver programs to test or
validate some behavior of the object.
\par
The directory structure is fairly flat --- no object directory
contains another --- because the C language does not support
inheritance.
This can be inelegant at times.
For example, a bipartite graph (a {\tt BPG} object) 
{\it is--a} graph (a {\tt Graph} object), but instead of {\tt BPG}
inheriting from {\tt Graph} data fields and methods from {\tt Graph},
we must use the {\it has--a} relation.
A {\tt BPG} object contains a pointer to a {\tt Graph} object
that represents the adjacency structure.
The situation is even more cumbersome for the objects that deal
with trees of one form or another: an elimination tree {\tt ETree}
and a domain/separator tree {\tt DSTree} each contain a pointer to
a generic tree object {\tt Tree} in their structure.
\par
Predecessors to this library were written in C++ and 
Objective-C.\footnote{The knowledgeable reader is encouraged to
peruse the source to discover the prejudices both pro and con
towards these two languages.}
The port to the present C library was painless, almost mechanical.
We expect the port back to C++ and/or Objective-C to be simple.
\par
Objects are one of two types: 
{\it data objects} whose primary function is to store data
and
{\it algorithm objects} whose function is to manipulate some data
object(s) and return new information.
Naturally this distinction can be fuzzy --- algorithm objects have
their own data that may be persistent and data objects can execute
some simple functionality --- but it holds in general.
To be more explicit, data objects have the following properties:
\begin{itemize}
\item
There is a delicate balance between encapsulation and openness.
The C language does not support any private or protected data
fields, so the C {\tt struct} that holds the data for an object
is completely open.
As an example, the {\tt Graph} object has a function to return the
size of and pointers to a vector that contains an adjacency list,
namely
\begin{verbatim}
     void Graph_adjAndSize(Graph *g, int v, int *psize, int **padj)
\end{verbatim}
where the pointers {\tt psize} and {\tt padj} are filled with the
size of the adjacency structure and a pointer to its vector.
One can get this same information by chasing pointers as follows.
\begin{verbatim}
     vsize = g->adjIVL->sizes[v] ;
     vadj  = g->adjIVL->p_ind[v] ;
\end{verbatim}
One can do the latter but we encourage the former.
As an experiment we replaced every instance of 
{\tt Graph\_adjAndSize()} with the appropriate pointer chasing
(and a similar operation for the {\tt IVL} object) and achieved
around a ten per cent reduction in the ordering time.
For a production code, this savings might drive the change in code,
but for our research code we kept the function call.
\item
Persistent storage needs to be supported.
% Currently this means file storage, but in the future we expect to
% need to ``bundle'' a data object into a block of storage to be
% passed or communicated to a different processor.
Each data object has eight different methods to deal with file I/O.
Two methods deal with reading from and writing to a file whose
suffix is associated with the object name, e.g., {\tt *graph\{f,b\}}
for a formatted or binary file to hold a {\tt Graph} object.
Four methods deal with reading and writing objects from and to a
file that is already opened and positioned, necessary for composite
objects (e.g., a {\tt Graph} object contains an {\tt IVL} object).
Two methods deal with writing the objects to a formatted file to be
examined by the user.
We strongly encourage any new data object added to the library to 
supply this functionality.
\item
Some data objects need to have compact storage requirements.
Two examples are our {\tt Chv} and {\tt SubMtx} objects.
Both objects need to be communicated between processes
in the MPI implementation, the former during the factorization,
the latter during the solve.
Each has a workspace buffer that contains all the information
needed to {\it regenerate} the object upon reception by another
process.
\item
By and large, data objects have simple methods.
A {\tt Graph} object does {\bf not} have methods to find a good
bisector; this is a sufficiently sophisticated function that it
should be implemented by an algorithm object.
The major exception to this rule is that our {\tt FrontMtx} object
{\it contains} the factorization data but also {\it performs} the
factorization, forward and backsolves.
In the future we intend to separate these two functionalities.
For example, one can implement an alternative forward and backsolve
by using methods to {\it access} the factor data stored in the {\tt
FrontMtx} object.
As a second example, massive changes to the storage format,
e.g., in an out-of-core implementation, can be encapsulated in the
access methods for the data, and any changes to the factorization
or solve functions could be minimal.
\end{itemize}
Algorithm objects have these properties.
\begin{itemize}
\item
Algorithm objects use data objects.
Some data objects are created within an algorithm objects method; 
these are owned by the algorithm object and free'd by that object.
Data objects that are passed to algorithm objects can be queried
or {\it temporarily} changed.
\item
They do not destroy or free data objects that are passed to them.
Any side effects on the data objects should be innocent, e.g.,
when a {\tt Graph} object is passed to the graph partitioning
object ({\tt GPart}) or the multistage minimum degree object
({\tt MSMD}), on return the adjacency lists may not be in the input
order, but they contain the values they had on input.
\item
Algorithm objects should support diagnostic, logfile and debug output.
This convention is not entirely thought out or supported at present. 
The rationale is that an algorithm object should be able to respond
to its environment to a greater degree than a data object.
\end{itemize}
\par
Data and algorithm objects share two common properties.
\begin{itemize}
\item
Each object has four basic methods: to allocate storage for an
object, set the default fields of an object, 
clear the data fields of an object,
and free the storage occupied and owned by an object.
\item
Ownership of data is very rigidly defined.
In most cases,
an object owns all data that is allocated inside one of its
methods, 
and when this does not hold it is very plainly documented.
For example, the bipartite graph object {\tt BPG} has a data field
that points to a {\tt Graph} object.
One of its initialization methods has a {\tt Graph} pointer in its
calling sequence.
The {\tt BPG} object then owns the {\tt Graph} object and when it
is free'd or has its data cleared, the {\tt Graph} object is free'd
by a call to its appropriate method.
\end{itemize}
\par
By and large these conventions hold throughout the library.
There are fuzzy areas and objects still ``under construction''.
Here are two examples.
\begin{itemize}
\item
We have an {\tt IIheap} object that maintains integer 
$\langle$ key, value $\rangle$ pairs in a priority heap.
Normally we think of a heap as a data structure, but another 
perspective is that of a continuously running algorithm that
supports insert, delete and identification of a minimum pair.
\item
Our {\tt BPG} bipartite graph object is a data object, 
but it has a method to find the Dulmage-Mendelsohn decomposition, 
a fairly involved algorithm used to refine a separator of a graph.
At present, we are not willing to create a new algorithm object
just to find the Dulmage-Mendelsohn decomposition, so we leave
this method to the domain of the data object.
The desired functionality, identifying minimal
weight separators for a region of a graph, can be modeled using
max flow techniques from network optimization.
We also provide a {\tt BPG} method that finds this
Dulmage-Mendelsohn decomposition by solving a max flow problem on
a bipartite network.
Both these methods have been superceded by the {\tt Network} object
that contains a method to find a max flow and one or more min-cuts 
of a network (not necessarily bipartite).
\end{itemize}
\par
The {\bf SPOOLES} software library is continuously evolving in an
almost organic fashion.
Growth and change are to be expected, and welcomed, but some
discipline is required to keep the complexity, both code and human
understanding, under control.
The guidelines we have just presented have two purposes:
to let the user and researcher get a better understanding of
the design of the library, and to point out some conventions 
to be used in extending the library.
\par
\section{Changes from Release 1.0}
\label{section:intro:changes-1.0}
\par
There are two major changes from the first release of the {\bf
SPOOLES} package: we now support complex linear systems, and the
storage format of the sparse factor matrices has changed from a
one-dimensional data decomposition to a two-dimensional
decomposition. 
The factors are now submatrix based, and thus allow a parallel
solve to be much faster than in Release 1.0.
\par
In the first release, all numeric objects had a {\tt `D`} as the
leading letter in their name, e.g., {\tt DA2}, {\tt DChv}, etc.
A natural way to implement complex data types would be to write 
``parallel'' objects, e.g., {\tt ZA2}, {\tt ZChv}, etc, as is done
in LINPACK and LAPACK for subroutine names.
However, a {\tt DA2} and {\tt ZA2} object share so much common code
that it is a better decision to combine the real and complex
functionality into one object.
This is even more pronounced for the {\tt FrontMtx} object where
there is virtually no code that is dependent on whether or not the
matrix is real or complex.
\par
Virtually no new work has been done on the ordering objects and methods.
Their algorithms were state of the art two years ago, but a recent
comparison with the {\bf EXTREME} \cite{hr98-msndtalk}
and {\bf METIS} \cite{karypis98metis} packages 
on a large collection
of finite element problems shows that the {\bf SPOOLES} orderings
are still competitive.
\par
The serial, multithreaded and MPI code has been modified to force
greater sharing of code between the environments.
``What'' is done is identical in the three cases.
The multithreaded and MPI codes share the same ``choreography'',
in other words, who does what and how.
The main differences between multithreaded and MPI
are that the data structures are global versus local,
and that explicit message passing is done in the latter.
This common structure of the codes has a nonzero impact on the speed 
and efficiency of the individual codes, but the gains from a common
code base are well worth the cost.
\par
The MPI methods have been extensively reworked since the first
release. 
A number of bugs and logic errors have been detected and fixed.
The code appears to be more robust than the first release.
\par
\section{Changes from Release 2.0}
\label{section:intro:changes-2.0}
\par
Release 2.2 is partly a maintenance release.
Some bugs were found and fixed in the MPI factors and solves.
Some minor new methods were added to the 
{\tt DenseMtx}, {\tt FrontMtx}, {\tt InpMtx}
and {\tt Utilities} directories.
The multithreaded methods and drivers have been removed from the
{\tt FrontMtx} directory and placed in a new {\tt MT} directory,
much like the {\tt MPI} methods have their own directory.
\par
Some new functionality has been added.
\begin{itemize}
\item
There are now multithreaded
and distributed matrix-matrix multiply methods.
See the {\tt MT} and {\tt MPI} directories.
\item
The {\tt FrontMtx} object now supports more robust reporting of
errors encountered during the factorization.
There is one additional parameter in the factorization calling
sequences, an error return that signals that the factorization has
failed.
\item
In response to customer requirements, we have added some
``patch-and-go'' functionality to the sparse $LU$ and $U^TDU$
factorizations without pivoting.
There are applications in optimization and structural analysis
where pivoting is not necessary for stabilty, but where the
location of small or zero pivots on the diagonal is meaningful.
Normally the factorization would be ustable or stop, but special
action is taken, the factors are ``patched'' and the factorization
continues.
\par
There is a new {\tt PatchAndGoInfo} object that encapsulates the
``patch-and-go'' strategy and gathers optional statistics about the
action that was taken during the factorization.
This object is attached to the {\tt FrontMtx} object which passes
it unchanged to the {\tt Chv} object that performs the
factorization of each front.
If the user does not need this functionality, no changes are
necessary to their code, i.e., no calling sequences are affected.
\item
New MPI broadcast methods for the {\tt Graph}, {\tt IVL} and
{\tt ETree} objects have been added to the library.
\item
The {\tt Iter} directory contains the following Krylov accelerators 
for the iterative solution of linear systems:
Block GMRES, BiCGStab, conjugate gradient and transpose-free QMR.
Each is available in both left- and right-preconditioned forms.
The preconditioner that these methods use is a {\tt FrontMtx}
object that contains a drop tolerance approximate factorization.
The {\tt ILUMtx} object contains a simple vector-based drop
tolerance factorization object.
(The {\tt FrontMtx} approximate factorization is submatrix-based in
both its data structures and computational kernels, and supports
pivoting for numerical stability, which the {\tt ILUMtx} object
does not.)
We have not written Krylov methods that use the {\tt ILUMtx}
object, but it would be simple to replace the {\tt FrontMtx}
preconditioner with the {\tt ILUMtx} preconditioner.
\item
The {\tt SemiImplMtx} object contains
a {\it semi-implicit} factorization, 
a technique that can require less storage and solve operations than
the present explicit factorization. It is based on the equation
$$
\left \lbrack \begin{array}{cc}
A_{0,0} & A_{0,1} \cr
A_{1,0} & A_{1,1} 
\end{array} \right \rbrack
=
\left \lbrack \begin{array}{cc}
L_{0,0} & 0 \cr
L_{1,0} & L_{1,1} 
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
U_{0,0} & U_{0,1} \cr
 0 & U_{1,1} 
\end{array} \right \rbrack,
=
\left \lbrack \begin{array}{cc}
L_{0,0} & 0 \cr
A_{1,0}U_{0,0}^{-1} & L_{1,1} 
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
U_{0,0} & L_{0,0}^{-1}U_{0,1} \cr
 0 & U_{1,1}
\end{array} \right \rbrack.
$$
A solve of $AX = B$ 
with the explicit factorization does the following steps
\begin{itemize}
\item solve $L_{0,0} Y_0 = B_0$
\item solve $L_{1,1} U_{1,1} X_1 = B_1 - L_{1,0} Y_0$
\item solve $U_{0,0} X_0 = Y_0 - U_{0,1} X_1$
\end{itemize}
while an implicit factorization has the following form.
\begin{itemize}
\item solve $L_{0,0} U_{0,0} Z_0 = B_0$
\item solve $L_{1,1} U_{1,1} X_1 = B_1 - A_{1,0} Z_0$
\item solve $L_{0,0} U_{0,0} X_0 = B_0 - A_{0,1} X_1$
\end{itemize}
The difference is that the semi-implicit factorization stores
and computes with 
$A_{1,0}$ and $A_{0,1}$ instead of $L_{1,0}$ and $U_{0,1}$,
(this can be a modest savings in storage and operation count),
and performs two solves with $L_{0,0}$ and $U_{0,0}$
instead of one.
This technique works with either a direct or approximate
factorization of $A$.
The semi-implicit factorization is constructed via a
post-processing of any factorization computed by the {\tt
FrontMtx} object.
\end{itemize}
